5 - Lag classes¶
This tutorial focuses the estimation of lag classes. It is one of the most important, maybe the most important step for estimating variograms. Usually, lag class generation, or binning, is not really focused in geostatistical literature. The main reason is, that usually, the same method is used. A user-set amount of equidistant lag classes is formed with 0 as lower bound and maxlag as upper bound. Maxlag is often set to the median or 60% percentile of all pairwise separating
distances.
In SciKit-GStat this is also the default behavior, but only one of dozen of different implemented methods. Thus, we want to shed some light onto the other methods here. SciKit-GStat implements methods of two different kinds. The first kind are the methods, that take a fixed N, the number of lag classes, accessible through the Variogram.n_lags property. These methods are ['even', 'uniform', 'kmeans', 'ward']. The other kind is often used in histogram estimation and will apply a
(simple) rule to figure out a suitable N themself. Using one of these methods will overwrite the Variogram.n_lags property. THese methods are: ['sturges', 'scott', 'fd', 'sqrt', 'doane'].
[1]:
import skgstat as skg
import numpy as np
import pandas as pd
from imageio import imread
import plotly.graph_objects as go
from plotly.subplots import make_subplots
skg.plotting.backend('plotly')
5.1 Sample data¶
Loads a data sample and draws n_samples from the field. For sampling the field, random samples from a gamma distribution with a fairly high scale are drawn, to ensure there are some outliers in the samle. The values are then re-scaled to the shape of the random field and the values are extracted from it.
[2]:
n_samples = 150
CHANNEL = 0
pan = imread('./data/pancake6_500x500.png')
r = pan[:,:,CHANNEL]
# sample coordinates from a gamma distribution
np.random.seed(1312)
c = np.random.gamma(10, 40, size=(n_samples,2))
# rescale to image size
c[::,0] = c[::,0] / np.max(c[::,0] * 1.1) + 0.02
c[::,1] = c[::,1] / np.max(c[::,1] * 1.1) + 0.02
coords = (c * pan.shape[0]).astype(int)
# sample red values
vals = [r[c[0], c[1]] for c in coords]
[3]:
fig = make_subplots(1,2,shared_xaxes=True, shared_yaxes=True)
fig.add_trace(
go.Scatter(x=coords[:,0], y=coords[:,1], mode='markers', marker=dict(color=vals,cmin=0, cmax=255), name='samples'),
row=1, col=1
)
fig.add_trace(go.Heatmap(z=r, name='field'), row=1, col=2)
fig.update_layout(width=900, height=450, template='plotly_white')
5.2 Lag class binning - fixed N¶
Apply different lag class binning methods and visualize their histograms. In this section, the distance matrix between all point pair combinations (NxN) is binned using each method. The plots visualize the histrogram of the distance matrix of the variogram, not the variogram lag classes themselves.
[4]:
N = 15
# use a nugget
V = skg.Variogram(coords, vals, n_lags=N, use_nugget=True)
5.2.1 default 'even' lag classes¶
The default binning method will find N equidistant bins. This is the default behavior and used in almost all geostatistical publications. It should not be used without a maxlag (like done in the plot below).
[5]:
# apply binning
bins, _ = skg.binning.even_width_lags(V.distance, N, None)
# get the histogram
count, _ = np.histogram(V.distance, bins=bins)
go.Figure(
go.Bar(x=bins, y=count),
layout=dict(template='plotly_white', title=r"$\texttt{'even'}~~binning$")
)
5.2.2 'uniform' lag classes¶
The histogram of the uniform method will adjust the lag class widths to have the same sample size for each lag class. This can be used, when there must not be any empty lag classes on small data samples, or comparable sample sizes are desireable for the semi-variance estimator.
[6]:
# apply binning
bins, _ = skg.binning.uniform_count_lags(V.distance, N, None)
# get the histogram
count, _ = np.histogram(V.distance, bins=bins)
go.Figure(
go.Bar(x=bins, y=count),
layout=dict(template='plotly_white', title=r"$\texttt{'uniform'}~~binning$")
)
5.2.3 'kmeans' lag classes¶
The distance matrix is clustered by a K-Means algorithm. The centroids, are taken as a good guess for lag class centers. Each lag class is then formed by taking half the distance to each sorted neighboring centroid as a bound. This will most likely result in non-equidistant lag classes.
One important note about K-Means clustering is, that it is not a deterministic method, as the starting points for clustering are taken randomly. Thus, the decision was made to seed the random start values. Therefore, the K-Means implementation in SciKit-GStat is deterministic and will always return the same lag classes for the same distance matrix.
[7]:
# apply binning
bins, _ = skg.binning.kmeans(V.distance, N, None)
# get the histogram
count, _ = np.histogram(V.distance, bins=bins)
go.Figure(
go.Bar(x=bins, y=count),
layout=dict(template='plotly_white', title=r"$\texttt{'K-Means'}~~binning$")
)
5.2.4 'ward' lag classes¶
The other clustering algorithm is a hierarchical clustering algorithm. This algorithm groups values together based on their similarity, which is expressed by Ward’s criterion. Agglomerative algorithms work iteratively and deterministic, as at first iteration each value forms a cluster on its own. Each cluster is then merged with the most similar other cluster, one at a time, until all clusters are merged, or the clustering is interrupted. Here, the clustering is interrupted as soon as the specified number of lag classes is reached. The lag classes are then formed similar to the K-Means method, either by taking the cluster mean or median as center.
Ward’s criterion defines the one other cluster as the closest, that results in the smallest intra-cluster variance for the merged clusters. The main downside is the processing speed. You will see a significant difference for 'ward' and should not use it on medium and large datasets.
[8]:
# apply binning
bins, _ = skg.binning.ward(V.distance, N, None)
# get the histogram
count, _ = np.histogram(V.distance, bins=bins)
go.Figure(
go.Bar(x=bins, y=count),
layout=dict(template='plotly_white', title=r"$\texttt{'ward'}~~binning$")
)
5.3 Lag class binning - adjustable N¶
5.3.1 'sturges' lag classes¶
Sturge’s rule is well known and pretty straightforward. It’s the default method for histograms in R. The number of equidistant lag classes is defined like:
Sturge’s rule works good for small, normal distributed datasets.
[9]:
# apply binning
bins, n = skg.binning.auto_derived_lags(V.distance, 'sturges', None)
# get the histogram
count, _ = np.histogram(V.distance, bins=bins)
go.Figure(
go.Bar(x=bins, y=count),
layout=dict(template='plotly_white', title=r"$\texttt{'sturges'}~~binning~~%d~classes$" % n)
)
5.3.2 'scott' lag classes¶
Scott’s rule is another quite popular approach to estimate histograms. The rule is defined like:
Other than Sturge’s rule, it will estimate the lag class width from the sample size standard deviation. Thus, it is also quite sensitive to outliers.
[10]:
# apply binning
bins, n = skg.binning.auto_derived_lags(V.distance, 'scott', None)
# get the histogram
count, _ = np.histogram(V.distance, bins=bins)
go.Figure(
go.Bar(x=bins, y=count),
layout=dict(template='plotly_white', title=r"$\texttt{'scott'}~~binning~~%d~classes$" % n)
)
5.3.3 'sqrt' lag classes¶
The only advantage of this method is its speed. The number of lag classes is simply defined like:
Thus, it’s usually not really a good choice, unless you have a lot of samples.
[11]:
# apply binning
bins, n = skg.binning.auto_derived_lags(V.distance, 'sqrt', None)
# get the histogram
count, _ = np.histogram(V.distance, bins=bins)
go.Figure(
go.Bar(x=bins, y=count),
layout=dict(template='plotly_white', title=r"$\texttt{'sqrt'}~~binning~~%d~classes$" % n)
)
5.3.4 'fd' lag classes¶
The Freedman-Diaconis estimator can be used to derive the number of lag classes again from an optimal lag class width like:
As it is based on the interquartile range (IQR), it is very robust to outlier. That makes it a suitable method to estimate lag classes on non-normal distance matrices. On the other side it usually over-estimates the \(n\) for small datasets. Thus it should only be used on medium to small datasets.
[12]:
# apply binning
bins, n = skg.binning.auto_derived_lags(V.distance, 'fd', None)
# get the histogram
count, _ = np.histogram(V.distance, bins=bins)
go.Figure(
go.Bar(x=bins, y=count),
layout=dict(template='plotly_white', title=r"$\texttt{'fd'}~~binning~~%d~classes$" % n)
)
5.3.5 'doane' lag classes¶
Doane’s rule is an extension to Sturge’s rule that takes the skewness of the distance matrix into account. It was found to be a very reasonable choice on most datasets where the other estimators didn’t yield good results.
It is defined like:
[13]:
# apply binning
bins, n = skg.binning.auto_derived_lags(V.distance, 'doane', None)
# get the histogram
count, _ = np.histogram(V.distance, bins=bins)
go.Figure(
go.Bar(x=bins, y=count),
layout=dict(template='plotly_white', title=r"$\texttt{'doane'}~~binning~~%d~classes$" % n)
)
5.4 Variograms¶
The following section will give an overview on the influence of the chosen binning method on the resulting variogram. All parameters will be the same for all variograms, so any change is due to the lag class binning. The variogram will use a maximum lag of 200 to get rid of the very thin last bins at large distances.
The maxlag is very close to the effective range of the variogram, thus you can only see differences in sill. But the variogram fitting is not at the focus of this tutorial. You can also change the parameter and fit a more suitable spatial model
[14]:
# use a exponential model
V.set_model('spherical')
# set the maxlag
V.maxlag = 200
5.4.1 'even' lag classes¶
[15]:
# set the new binning method
V.bin_func = 'even'
# plot
fig = V.plot(show=False)
print(f'"{V._bin_func_name}" - range: {np.round(V.cof[0], 1)} sill: {np.round(V.cof[1], 1)}')
fig.update_layout(template='plotly_white')
"even" - range: 200.0 sill: 627.1
5.4.2 'uniform' lag classes¶
[16]:
# set the new binning method
V.bin_func = 'uniform'
# plot
fig = V.plot(show=False)
print(f'"{V._bin_func_name}" - range: {np.round(V.cof[0], 1)} sill: {np.round(V.cof[1], 1)}')
fig.update_layout(template='plotly_white')
"uniform" - range: 200.0 sill: 551.2
5.4.3 'kmeans' lag classes¶
[17]:
# set the new binning method
V.bin_func = 'kmeans'
# plot
fig = V.plot(show=False)
print(f'"{V._bin_func_name}" - range: {np.round(V.cof[0], 1)} sill: {np.round(V.cof[1], 1)}')
fig.update_layout(template='plotly_white')
"kmeans" - range: 184.0 sill: 579.0
5.4.4 'ward' lag classes¶
[18]:
# set the new binning method
V.bin_func = 'ward'
# plot
fig = V.plot(show=False)
print(f'"{V._bin_func_name}" - range: {np.round(V.cof[0], 1)} sill: {np.round(V.cof[1], 1)}')
fig.update_layout(template='plotly_white')
"ward" - range: 181.8 sill: 569.0
5.4.5 'sturges' lag classes¶
[19]:
# set the new binning method
V.bin_func = 'sturges'
# plot
fig = V.plot(show=False)
print(f'"{V._bin_func_name}" adjusted {V.n_lags} lag classes - range: {np.round(V.cof[0], 1)} sill: {np.round(V.cof[1], 1)}')
fig.update_layout(template='plotly_white')
"sturges" adjusted 15 lag classes - range: 200.0 sill: 624.1
5.4.6 'scott' lag classes¶
[20]:
# set the new binning method
V.bin_func = 'scott'
# plot
fig = V.plot(show=False)
print(f'"{V._bin_func_name}" adjusted {V.n_lags} lag classes - range: {np.round(V.cof[0], 1)} sill: {np.round(V.cof[1], 1)}')
fig.update_layout(template='plotly_white')
"scott" adjusted 27 lag classes - range: 200.0 sill: 621.1
5.4.7 'fd' lag classes¶
[21]:
# set the new binning method
V.bin_func = 'fd'
# plot
fig = V.plot(show=False)
print(f'"{V._bin_func_name}" adjusted {V.n_lags} lag classes - range: {np.round(V.cof[0], 1)} sill: {np.round(V.cof[1], 1)}')
fig.update_layout(template='plotly_white')
"fd" adjusted 31 lag classes - range: 200.0 sill: 616.6
5.4.8 'sqrt' lag classes¶
[22]:
# set the new binning method
V.bin_func = 'sqrt'
# plot
fig = V.plot(show=False)
print(f'"{V._bin_func_name}" adjusted {V.n_lags} lag classes - range: {np.round(V.cof[0], 1)} sill: {np.round(V.cof[1], 1)}')
fig.update_layout(template='plotly_white')
"sqrt" adjusted 100 lag classes - range: 200.0 sill: 614.9
5.4.9 'doane' lag classes¶
[23]:
# set the new binning method
V.bin_func = 'doane'
# plot
fig = V.plot(show=False)
print(f'"{V._bin_func_name}" adjusted {V.n_lags} lag classes - range: {np.round(V.cof[0], 1)} sill: {np.round(V.cof[1], 1)}')
fig.update_layout(template='plotly_white')
"doane" adjusted 18 lag classes - range: 200.0 sill: 618.7